Search CORE

19 research outputs found

A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date: 01/01/2012
Field of study

<div>Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor (<a href="http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html">http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html</a>). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs. </div

Directory of Open Access Journals

PubMed Central

FigShare

DDGraphs for the 5 CRM classes inferred by the NCPC algorithm at α = 0.05.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

Variables in green circles are target variables. Variables in ovals are inferred causal neighbours. Variables in rectangles are inferred to have indirect dependence with the target. Values on the edges are (unadjusted) P-values from conditional independence tests. The same NCPC algorithm with no multiple testing correction was used as in the synthetic data benchmark. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#pcbi-1002725-g001" target="_blank">Figure 1</a> for the graphical vocabulary.</p

FigShare

Comparison of DDGraphs and DAGs.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

(A) The causal neighbourhood of the target variable T consists of variables X1 and X2, while T's Markov blanket consists of X1, X2, X4 (in ovals). The remaining variables X3 and X5 have indirect dependence (in rectangles). The DDGraph (left) and the DAG (right) represent the same conditional dependencies. The causal neighbourhood/the Markov blanket and the variable in indirect dependence are distinguishable by the variable shapes in the DDGraph, but have to be inferred in the DAG by following the edges. (B) joint dependency patterns representable in the DDGraph (left) cannot be represented by DAGs (right). The DAG shown here represents the conditional independencies between X1 (or X2) and T given X2 (or X1), but it does not represent the marginal dependency between X1 (or X2) and T. Neither this DAG or any other DAG can represent the entire joint dependency pattern.</p

FigShare

Combinatorial patterns of TFs in inferred causal neighbourhoods.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

For each combinatorial pattern we show the number of CRMs with this pattern in the CRM class and that in the rest of CRMs (percentages are given in parenthesis). The difference in the two frequencies (CRM class vs rest) and the corresponding P-value are given in the last two columns. P-values were computed from Fisher's exact test for each combination and adjusted for multiple testing using the Benjamini-Hochberg method. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#s4" target="_blank">Materials and Methods</a> for details. Frequency differences are colour-coded: blue for decrease in the CRM class, and orange for increase in the CRM class.</p

FigShare

Clustered pairwise correlation matrix of the 15 transcription factor binding profiles over all 310 CRMs.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

Note that the cluster that consists of Mef2 8–12 h and Bin 6–12 h (lower left corner of the matrix) is anti-correlated with early Twi 2–4 h binding.</p

FigShare

Proportion of correct predictions for the “Time” scenario.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

Each cell shows the mean proportion of correct predictions (with 95% confidence intervals) averaged over 1000 data sets generated in each case. Highest prediction proportions accounting for variation in the data (pairwise T-tests with a cut-off of 0.001 for the P values) are shown in bold. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#s4" target="_blank">Materials and Methods</a> for the generation of the synthetic data and for the calculation of the correct prediction proportion.</p

FigShare

Two scenarios for generating the synthetic data with correlated variables.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

While the synthetic data were generated for a network of 15 explanatory variables, only variables X1 and X2 have direct dependence with the target variable T, and therefore constitute the causal neighborhood of T. Variable X3 is included as the confounding variable. (A) The “Time” scenario in which X1, X2 and X3 correspond to three time points with stronger correlation between X1 and X2 and between X2 and X3 than between X1 and X3. (B) The “Hidden” scenario in which X1, X2 and X3 are correlated due to a common cause H in the network. This common cause is used in data generation, but is not available to algorithms.</p

FigShare

The graphical vocabulary of the DDGraph.

Author: Audrey Qiuyan Fu (120394)
Boris Adryan (31888)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

The vocabulary consists of five types of nodes and two types of edges. For the edges, directed edges ending with dots indicate conditional independences between Xk and the target variable T given Xi. Undirected edges indicate dependencies, which involve T in different ways, and for conditional independencies between Xi and Xj given T. Consider a case of non-faithful distribution where T is an XOR function of X1 and X2 with carefully set parameters so that from data it looks like X1 and X2 are marginally independent of T. In this case, X1 and X2 would be conditionally dependent when conditioning on each other. This distribution would be represented as two dotted nodes with a dotted line between them, but disconnected from T. This kind of graph signals a non-faithful distribution where the neighbourhood and Markov blanket are not defined by transversing undirected edges from T.</p

FigShare

Dependency of early cardiac enhancer activities on tin.

Author: Angelike Stathopoulos (104580)
Anil Ozdemir (104578)
Boris Adryan (31888)
Hong Jin (65601)
Manfred Frasch (104582)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

Shown are stage 11–12 embryos stained for enhancer activities (anti-βGal or anti-GFP) and Tin (green). (A–K) Enhancer activities in wild type backgrounds (left corner quadrants: anti-Tin omitted for better visualization of reporter patterns; arrow heads: early cardiac expression). (A′–K′) Enhancer activities in homozygous tin346 mutant backgrounds. (A, A′) EgfrE1-LacZ expression in cardiac mesoderm but not in somatic mesoderm (asterisks) requires tin. (B, B′) fzL4-GFP expression in cardioblast progenitors requires tin. (C, C′) High-level HimL47-GFP expression in cardiogenic mesoderm requires tin but somatic mesodermal expression does not. (D, D′) lin-28L64-GFP expression in cardiac mesoderm requires tin. Amnioserosa expression is unaffected in tin mutants. (E, E′) midE19-GFP expression in cardioblast progenitors requires tin. (F, F′) RhoLE102-GFP expression in cardiogenic mesoderm, but not in somatic mesoderm, requires tin. (G, G′) tshL8-LacZ expression in cardiogenic mesoderm, but not in somatic mesoderm, requires tin. (H, H′) tupE9-GFP expression in cardiogenic mesoderm requires tin. (I, I′) unc-5L25-GFP expression in cardiac mesoderm but not in somatic mesoderm requires tin. (J, J′) CG3638L6-GFP expression in cardioblast progenitors requires tin. (K, K′) CG9973E15-GFP expression in cardioblast progenitors requires tin.</p

FigShare

Dependency of late cardiac enhancer activities within the dorsal vessel on tin.

Author: Angelike Stathopoulos (104580)
Anil Ozdemir (104578)
Boris Adryan (31888)
Hong Jin (65601)
Manfred Frasch (104582)
Robert Stojnic (104577)
Publication venue
Publication date
Field of study

Shown are reporter activities (anti-GFP, green), Tin+ cardioblasts and pericardial cells (anti-Tin, red) and Doc+ cardioblasts (anti-Doc, blue) in stage 15–16 control embryos (A–D) and in embryos specifically lacking Tin activity in cardiac cells (tinABD, tin346; A′–D′). (A) midE19-GFP is expressed specifically in the Tin+ cardioblasts. (A′) Absence of cardiac Tin expression causes a severe reduction of midE19-GFP activity. (B) tupE9-GFP is highly expressed in Tin+ cardioblasts (graded posteriorly-to-anteriorly) and, at much lower levels perduring from stage 12 expression, is present in Doc+ cardioblasts, pericardial cells, and dorsal somatic muscles. (B′) Upon loss of cardiac Tin expression almost all cardioblasts contain only low levels of perduring GFP. (C) unc-5L25-GFP expression in pericardial cells and (largely posteriorly) in cardioblasts. (C′) Absence of cardiac Tin expression causes near loss of cardioblast unc-5L25-GFP expression and a reduction of expression in pericardial cells. (D) CG3638L6-GFP is expressed in Tin+ cardioblasts (with variable intensities) and in Tin+ pericardial cells. (D′) Absence of cardiac Tin expression causes nearly complete loss of CG3638L6-GFP expression.</p

FigShare

A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles

DDGraphs for the 5 CRM classes inferred by the NCPC algorithm at <i>α</i> = 0.05.

Comparison of DDGraphs and DAGs.

Combinatorial patterns of TFs in inferred causal neighbourhoods.

Clustered pairwise correlation matrix of the 15 transcription factor binding profiles over all 310 CRMs.

Proportion of correct predictions for the “Time” scenario.

Two scenarios for generating the synthetic data with correlated variables.

The graphical vocabulary of the DDGraph.

Dependency of early cardiac enhancer activities on <i>tin</i>.

Dependency of late cardiac enhancer activities within the dorsal vessel on <i>tin</i>.